Raymond Walters lectures: https://www.youtube.com/watch?v=dVrF0l9jMgE and https://www.youtube.com/watch?v=QVPNouAbXsY
Original LD score regression paper: LD Score regression distinguishes confounding from polygenicity in genome-wide association studies
Stratified LD score regression paper: Partitioning heritability by functional annotation using genome-wide association summary statistics
An atlas of genetic correlations across human diseases and traits
In genetics, the standard additive model is
\[\begin{equation} \tilde{y_i}=\sum_{j=1}^J \beta_jx_{ij} +\epsilon_i \end{equation}\]
where \(y_i\) measures our phenotype of interest, \(x_{ij}\) is the genotype matrix and \(\beta_j\) measures the effect size of SNP \(j\) on the phenotype.
The data is typically standardised so that \(var(\tilde{y_i})=1\) and all \(var(x_{ij})=1\), which implicitly assumes a relationship between \(\beta_j\) and MAF (e.g. that rarer things (smaller MAF) typically have a larger effect size to compensate). There are two extremes for this standardisation step: (i) once we assume a constant variance for \(\beta\), the variance explained by each SNP is the same (so that rarer things have a larger effect size to compensate) and (ii) no standardisation so that the distribution of effect sizes is the same and doesn’t depend on allele frequency. Realistically, it is somewhere between these extremes and will be trait specific.
In the genome, SNPs are correlated with one another and so from a GWAS we can estimate the marginal effects,
\[\begin{equation} \hat{\beta_j}^{GWAS}=s_j+\sum_{k=1}^J \beta_k r_{x_{i,j},x_{i,k}}+\epsilon_j \end{equation}\]
where \(s_j\) is some bias from confounders (e.g. population stratification or relatedness) and \(r_{x_{i,j},x_{i,k}}\) is the correlation between SNPs \(x_j\) and \(x_k\).
For each SNP, we calculate a \(\chi^2\) association statistic which estimates the effect size. If we define the LD score of SNP \(j\) as
\[\begin{equation} l_j=\sum_{k=1}^J r^2_{x_{i,j}, x_{i,k}} \end{equation}\]
then our expected \(\chi^2\) can be shown to equal
\[\begin{equation} E(\chi^2_j)=1+N\alpha+\dfrac{Nh^2_{SNP}}{M} l_j \end{equation}\]
where N is the sample size, \(\alpha\) is a measure of confounding and M is the number of SNPs. This relationship between \(\chi^2\) value and LD score is intuitive because the more things you tag (and the degree with which you tag), the more likely you are to tag a CV. More formally, “assuming a uniform prior, we see SNPs with more LD friends showing more association”. Note that \(h^2=\sum_j\beta_j^2\).
So, if we regress our \(\chi^2\) values from the GWAS on \(Nl_j\) for each SNP \(j\), we get:
Intercept: estimate of \(1+N\alpha\) (test for deviation from 1 as index of stratification/confounding and use to correct for confounding. \(>1\) implies confounding, similar to genomic control).
Slope: estimate of \(\frac{h^2_{SNP}}{M}\) (with known M, can convert to an estimate of \(h^2_{SNP}\)), i.e. how much it tracks with changes in LD.
This method was first used to distinguish between population stratification (where there will be no relationship between LD score and \(\chi^2\) association statistic) and actually interesting polygenic effects (where there will be a positive relationship between LD score and \(\chi^2\) association statistic) by examining the LD score regression intercept. This was compared with \(\lambda_{GC}\) values (with which the observed \(\chi^2\) values are divided by in the genomic-control method) to show that genomic control is unnecessarily conservative (LD score intercept \(<\lambda_{GC}\)).
LD score regression was developed as a tool to distinguish confounding from polygenicity in GWAS using only summary statistics.
It’s development was based on the fact that \(\chi^2\) values for true associations are positively correlated with LD scores whereas \(\chi^2\) values for false positives (e.g. due to population stratification/drift) are not correlated with LD scores.
The intercept of the \(\chi^2 \sim LD score\) regression estimates confounding (\(=1\) if no confounding) similarly (but arguably better than) \(\lambda_{GC}\).
An extention of LD score regression is stratified LD score regression, which aims to partition heritability by functional annotation.
We have previously assumed that
\[\begin{equation} Var(\beta_j)=\dfrac{h^2_{SNP}}{M} \end{equation}\]
i.e. that heritability from each SNP is on average the same genome wide. But what if we want to evaluate whether there are regions of the genome with stronger effects (i.e. higher \(Var(\beta_j)\))?
To do this, we allow the variance to vary between functional categories (\(C\)),
\[\begin{equation} Var(\beta_j)=\sum_{c:j\in C_c}\tau_c \end{equation}\]
with disjoint categories
\[\begin{equation} h^2_{SNP}(C_c)=\sum_{j\in C_c}\beta_j^2=\tau_c\times M(C_c) \end{equation}\]
otherwise we’re assuming overlapping categories act additively on the total variance.
The stratified LD score model now looks like,
\[\begin{equation} E(\chi^2_j)=1+N\alpha+N\sum_C \tau_c l_{j,C} \end{equation}\]
where C is some functional category. I.e. rather than summing for all LD friends, we are now summing for all LD friends which are also in some functional category \(c\). We can estimate \(\tau_c\) via multiple regression with \(l_{j,c}\) computed from reference data for a choice of annotation, where \(\tau_c\) is the per SNP contribution to heritability of category \(c\).
There are two ways to evaluate partitioned heritability results:
Full derivations can be found here.
It is often useful to define buffer regions around annotations. For example, rather than a binary 0/1 for whether the SNP falls in an annotation, it may be important to know whether a SNP lies very close to the boundaries of these annotations. For this reason, additional annotations can be defined for SNPs falling in these buffer regions (e.g. all annotations plus all annotations + buffer region).
Can extend to continuous annotations (rather than 0/1 whether it is in the annotation or not; https://www.nature.com/articles/ng.3954).
Used to make statements like “variants for BMI are enriched in regions that suggest active marks in CNS cells”.
Only requires summary statistics.
Does not assume a single CV per region.
Does not only use SNPs either reaching genome-wide significance or falling in genome-wide significant regions.
Accounts for LD.
Computationally efficient.
Requires large data sets and/or large SNP heritability.
Trait analysed must be polygenic.
Requires an LD reference panel matched to the population studied.
Not application to studies using custom genotyping arrays (due to using 1000 genomes data to find LD scores that need to be generalsied to the study SNPs).
Based on additive model and does not consider non-additive effects.